2025-05-20-12-13
Using Reinforcement Learning to Train Large Language Models to Explain Human Decisions
Abstract
arXiv:2505.11614v1 Announce Type: new Abstract: A central goal of cognitive modeling is to develop models that not only predict human behavior but also provide insight into the underlying cognitive mechanisms. While neural network models trained on large-scale behavioral data often achieve strong predictive performance, they typically fall short in offering interpretable explanations of the cognitive processes they capture. In this work, we explore the potential of pretrained large language models (LLMs) to serve as dual-purpose cognitive models--capable of both accurate prediction and interpretable explanation in natural language. Specifically, we employ reinforcement learning with outcome-based rewards to guide LLMs toward generating explicit reasoning traces for explaining human risky choices. Our findings demonstrate that this approach produces high-quality explanations alongside strong quantitative predictions of human decisions.
摘要
认知建模的核心目标是开发不仅能预测人类行为、还能揭示潜在认知机制的模型。尽管基于大规模行为数据训练的神经网络模型通常具有强大的预测性能,但它们往往无法对所捕捉的认知过程提供可解释的说明。本研究探索了预训练大语言模型(LLMs)作为双重用途认知模型的潜力——既能实现准确预测,又能以自然语言提供可解释的说明。具体而言,我们采用基于结果奖励的强化学习来引导LLMs生成显式推理轨迹,用以解释人类风险决策。研究结果表明,该方法在提供人类决策强大量化预测的同时,还能产生高质量的解释性说明。
PeerGuard: Defending Multi-Agent Systems Against Backdoor Attacks Through Mutual Reasoning
Abstract
arXiv:2505.11642v1 Announce Type: new Abstract: Multi-agent systems leverage advanced AI models as autonomous agents that interact, cooperate, or compete to complete complex tasks across applications such as robotics and traffic management. Despite their growing importance, safety in multi-agent systems remains largely underexplored, with most research focusing on single AI models rather than interacting agents. This work investigates backdoor vulnerabilities in multi-agent systems and proposes a defense mechanism based on agent interactions. By leveraging reasoning abilities, each agent evaluates responses from others to detect illogical reasoning processes, which indicate poisoned agents. Experiments on LLM-based multi-agent systems, including ChatGPT series and Llama 3, demonstrate the effectiveness of the proposed method, achieving high accuracy in identifying poisoned agents while minimizing false positives on clean agents. We believe this work provides insights into multi-agent system safety and contributes to the development of robust, trustworthy AI interactions.
摘要
多智能体系统利用先进的人工智能模型作为自主智能体,通过交互、合作或竞争完成机器人学和交通管理等应用中的复杂任务。尽管其重要性日益凸显,多智能体系统的安全性研究仍严重不足,现有工作多集中于单一AI模型而非交互智能体。本研究探讨多智能体系统中的后门漏洞,并提出基于智能体交互的防御机制。通过运用推理能力,每个智能体可评估其他智能体的响应以检测异常推理过程,从而识别被污染智能体。在基于大语言模型的多智能体系统(包括ChatGPT系列和Llama 3)上的实验表明,该方法能有效识别被污染智能体且对正常智能体误判率极低。我们相信这项工作为多智能体系统安全研究提供了新视角,有助于发展鲁棒、可信的人工智能交互。
FLOW-BENCH: Towards Conversational Generation of Enterprise Workflows
Abstract
arXiv:2505.11646v1 Announce Type: new Abstract: Business process automation (BPA) that leverages Large Language Models (LLMs) to convert natural language (NL) instructions into structured business process artifacts is becoming a hot research topic. This paper makes two technical contributions -- (i) FLOW-BENCH, a high quality dataset of paired natural language instructions and structured business process definitions to evaluate NL-based BPA tools, and support bourgeoning research in this area, and (ii) FLOW-GEN, our approach to utilize LLMs to translate natural language into an intermediate representation with Python syntax that facilitates final conversion into widely adopted business process definition languages, such as BPMN and DMN. We bootstrap FLOW-BENCH by demonstrating how it can be used to evaluate the components of FLOW-GEN across eight LLMs of varying sizes. We hope that FLOW-GEN and FLOW-BENCH catalyze further research in BPA making it more accessible to novice and expert users.
摘要
利用大型语言模型(LLMs)将自然语言(NL)指令转化为结构化业务流程制品的业务流程自动化(BPA)正成为研究热点。本文提出两项技术贡献:(i)FLOW-BENCH——一个高质量的自然语言指令与结构化业务流程定义配对数据集,用于评估基于NL的BPA工具,并支持该领域新兴研究;(ii)FLOW-GEN——我们提出的方法,通过LLMs将自然语言转换为具有Python语法的中间表示,从而促进最终转化为广泛采用的业务流程定义语言(如BPMN和DMN)。我们通过展示如何利用FLOW-BENCH评估FLOW-GEN在八个不同规模LLMs中的组件性能,实现了该数据集的初始构建。期望FLOW-GEN和FLOW-BENCH能推动BPA领域的进一步研究,使其更易于新手和专家用户使用。
Probing the Vulnerability of Large Language Models to Polysemantic Interventions
Abstract
arXiv:2505.11611v1 Announce Type: new Abstract: Polysemanticity -- where individual neurons encode multiple unrelated features -- is a well-known characteristic of large neural networks and remains a central challenge in the interpretability of language models. At the same time, its implications for model safety are also poorly understood. Leveraging recent advances in sparse autoencoders, we investigate the polysemantic structure of two small models (Pythia-70M and GPT-2-Small) and evaluate their vulnerability to targeted, covert interventions at the prompt, feature, token, and neuron levels. Our analysis reveals a consistent polysemantic topology shared across both models. Strikingly, we demonstrate that this structure can be exploited to mount effective interventions on two larger, black-box instruction-tuned models (LLaMA3.1-8B-Instruct and Gemma-2-9B-Instruct). These findings suggest not only the generalizability of the interventions but also point to a stable and transferable polysemantic structure that could potentially persist across architectures and training regimes.
摘要
多义性——即单个神经元编码多个无关特征的现象——是大型神经网络的一个显著特征,也始终是语言模型可解释性研究的核心挑战。与此同时,人们对其在模型安全性方面的影响也知之甚少。借助稀疏自编码器的最新进展,我们研究了两个小型模型(Pythia-70M和GPT-2-Small)的多义性结构,并评估了它们在提示、特征、标记和神经元层面上遭受针对性隐蔽干预的脆弱性。分析揭示了两模型共有的稳定多义性拓扑结构。引人注目的是,我们证明这种结构可被用于对两个更大的黑盒指令微调模型(LLaMA3.1-8B-Instruct和Gemma-2-9B-Instruct)实施有效干预。这些发现不仅表明干预措施具有普适性,更指向了一种稳定且可迁移的多义性结构——这种结构可能在不同架构和训练方案中持续存在。
Rethinking Optimal Verification Granularity for Compute-Efficient Test-Time Scaling
Abstract
arXiv:2505.11730v1 Announce Type: new Abstract: Test-time scaling (TTS) has proven effective in enhancing the reasoning capabilities of large language models (LLMs). Verification plays a key role in TTS, simultaneously influencing (1) reasoning performance and (2) compute efficiency, due to the quality and computational cost of verification. In this work, we challenge the conventional paradigms of verification, and make the first attempt toward systematically investigating the impact of verification granularity-that is, how frequently the verifier is invoked during generation, beyond verifying only the final output or individual generation steps. To this end, we introduce Variable Granularity Search (VG-Search), a unified algorithm that generalizes beam search and Best-of-N sampling via a tunable granularity parameter g. Extensive experiments with VG-Search under varying compute budgets, generator-verifier configurations, and task attributes reveal that dynamically selecting g can improve the compute efficiency and scaling behavior. Building on these findings, we propose adaptive VG-Search strategies that achieve accuracy gains of up to 3.1% over Beam Search and 3.6% over Best-of-N, while reducing FLOPs by over 52%. We will open-source the code to support future research.
摘要
测试时缩放(TTS)技术已被证明能有效增强大语言模型(LLMs)的推理能力。验证环节在TTS中起着关键作用,其质量与计算成本同时影响着(1)推理性能与(2)计算效率。本研究突破传统验证范式,首次系统探究验证粒度(即验证器在生成过程中被调用的频率,而非仅验证最终输出或单步生成)的影响机制。为此,我们提出可变粒度搜索算法(VG-Search),该统一算法通过可调粒度参数g泛化了束搜索与N最佳采样。在不同计算预算、生成器-验证器配置及任务属性下的实验表明:动态选择g能提升计算效率与缩放性能。基于此,我们提出自适应VG-Search策略,相比束搜索和N最佳采样分别实现最高3.1%和3.6%的准确率提升,同时减少52%以上的浮点运算量。相关代码将开源以支持后续研究。
DMN-Guided Prompting: A Low-Code Framework for Controlling LLM Behavior
Abstract
arXiv:2505.11701v1 Announce Type: new Abstract: Large Language Models (LLMs) have shown considerable potential in automating decision logic within knowledge-intensive processes. However, their effectiveness largely depends on the strategy and quality of prompting. Since decision logic is typically embedded in prompts, it becomes challenging for end users to modify or refine it. Decision Model and Notation (DMN) offers a standardized graphical approach for defining decision logic in a structured, user-friendly manner. This paper introduces a DMN-guided prompting framework that breaks down complex decision logic into smaller, manageable components, guiding LLMs through structured decision pathways. We implemented the framework in a graduate-level course where students submitted assignments. The assignments and DMN models representing feedback instructions served as inputs to our framework. The instructor evaluated the generated feedback and labeled it for performance assessment. Our approach demonstrated promising results, outperforming chain-of-thought (CoT) prompting. Students also responded positively to the generated feedback, reporting high levels of perceived usefulness in a survey based on the Technology Acceptance Model.
摘要
大语言模型(LLMs)在自动化知识密集型流程中的决策逻辑方面展现出显著潜力,但其效能很大程度上依赖于提示策略与质量。由于决策逻辑通常嵌入于提示中,终端用户难以对其进行修改或优化。决策模型与标记法(DMN)提供了一种标准化的图形化方法,能以结构化且用户友好的方式定义决策逻辑。本文提出一种DMN引导的提示框架,将复杂决策逻辑分解为更小、更易管理的组件,通过结构化决策路径引导LLMs。我们在研究生课程中实施该框架,学生提交作业后,作业内容和代表反馈指令的DMN模型作为框架输入。授课教师对生成的反馈进行评估并标注性能指标。该方法表现出优于思维链(CoT)提示的效果,且学生基于技术接受模型的调查反馈显示,他们对生成反馈的感知有用性评价较高。
LLM Agents Are Hypersensitive to Nudges
Abstract
arXiv:2505.11584v1 Announce Type: new Abstract: LLMs are being set loose in complex, real-world environments involving sequential decision-making and tool use. Often, this involves making choices on behalf of human users. However, not much is known about the distribution of such choices, and how susceptible they are to different choice architectures. We perform a case study with a few such LLM models on a multi-attribute tabular decision-making problem, under canonical nudges such as the default option, suggestions, and information highlighting, as well as additional prompting strategies. We show that, despite superficial similarities to human choice distributions, such models differ in subtle but important ways. First, they show much higher susceptibility to the nudges. Second, they diverge in points earned, being affected by factors like the idiosyncrasy of available prizes. Third, they diverge in information acquisition strategies: e.g. incurring substantial cost to reveal too much information, or selecting without revealing any. Moreover, we show that simple prompt strategies like zero-shot chain of thought (CoT) can shift the choice distribution, and few-shot prompting with human data can induce greater alignment. Yet, none of these methods resolve the sensitivity of these models to nudges. Finally, we show how optimal nudges optimized with a human resource-rational model can similarly increase LLM performance for some models. All these findings suggest that behavioral tests are needed before deploying models as agents or assistants acting on behalf of users in complex environments.
摘要
大型语言模型(LLMs)正被部署于涉及序列决策和工具使用的复杂现实环境中。这类场景通常需要模型代表人类用户做出选择。然而,目前对此类选择的分布特征及其对不同选择架构的敏感性仍缺乏深入研究。我们针对若干LLM模型开展案例研究,通过多属性表格决策任务,考察默认选项、建议提示、信息突显等经典助推手段及额外提示策略的影响。研究发现,尽管这些模型的表面选择分布与人类存在相似性,但在细微而关键的维度上存在差异:首先,它们对助推手段表现出更高的敏感性;其次,在收益获取上存在偏离,易受奖品特异性等因素影响;第三,其信息获取策略显著不同,例如可能付出高昂成本获取过量信息,或在未揭示任何信息时直接选择。实验还表明,零样本思维链(CoT)等简单提示策略能改变选择分布,而基于人类数据的少样本提示可提升对齐性,但这些方法均未能消除模型对助推的敏感性。最后,我们证明基于人类资源理性模型优化的最佳助推策略同样能提升部分LLM的性能。这些发现共同表明,在将模型作为用户代理部署于复杂环境前,必须进行行为测试。
Cloud-Based AI Systems: Leveraging Large Language Models for Intelligent Fault Detection and Autonomous Self-Healing
Abstract
arXiv:2505.11743v1 Announce Type: new Abstract: With the rapid development of cloud computing systems and the increasing complexity of their infrastructure, intelligent mechanisms to detect and mitigate failures in real time are becoming increasingly important. Traditional methods of failure detection are often difficult to cope with the scale and dynamics of modern cloud environments. In this study, we propose a novel AI framework based on Massive Language Model (LLM) for intelligent fault detection and self-healing mechanisms in cloud systems. The model combines existing machine learning fault detection algorithms with LLM's natural language understanding capabilities to process and parse system logs, error reports, and real-time data streams through semantic context. The method adopts a multi-level architecture, combined with supervised learning for fault classification and unsupervised learning for anomaly detection, so that the system can predict potential failures before they occur and automatically trigger the self-healing mechanism. Experimental results show that the proposed model is significantly better than the traditional fault detection system in terms of fault detection accuracy, system downtime reduction and recovery speed.
摘要
随着云计算系统的快速发展和基础设施日益复杂,实时检测与缓解故障的智能机制变得愈发重要。传统故障检测方法往往难以应对现代云环境的规模与动态性。本研究提出一种基于大语言模型(LLM)的新型人工智能框架,用于实现云系统智能故障检测与自愈机制。该模型将现有机器学习故障检测算法与LLM的自然语言理解能力相结合,通过语义上下文处理解析系统日志、错误报告和实时数据流。该方法采用多层架构,结合监督学习进行故障分类和无监督学习进行异常检测,使系统能够在潜在故障发生前进行预测并自动触发自愈机制。实验结果表明,所提模型在故障检测准确率、系统停机时间缩减及恢复速度方面显著优于传统故障检测系统。
Heart2Mind: Human-Centered Contestable Psychiatric Disorder Diagnosis System using Wearable ECG Monitors
Abstract
arXiv:2505.11612v1 Announce Type: new Abstract: Psychiatric disorders affect millions globally, yet their diagnosis faces significant challenges in clinical practice due to subjective assessments and accessibility concerns, leading to potential delays in treatment. To help address this issue, we present Heart2Mind, a human-centered contestable psychiatric disorder diagnosis system using wearable electrocardiogram (ECG) monitors. Our approach leverages cardiac biomarkers, particularly heart rate variability (HRV) and R-R intervals (RRI) time series, as objective indicators of autonomic dysfunction in psychiatric conditions. The system comprises three key components: (1) a Cardiac Monitoring Interface (CMI) for real-time data acquisition from Polar H9/H10 devices; (2) a Multi-Scale Temporal-Frequency Transformer (MSTFT) that processes RRI time series through integrated time-frequency domain analysis; (3) a Contestable Diagnosis Interface (CDI) combining Self-Adversarial Explanations (SAEs) with contestable Large Language Models (LLMs). Our MSTFT achieves 91.7% accuracy on the HRV-ACC dataset using leave-one-out cross-validation, outperforming state-of-the-art methods. SAEs successfully detect inconsistencies in model predictions by comparing attention-based and gradient-based explanations, while LLMs enable clinicians to validate correct predictions and contest erroneous ones. This work demonstrates the feasibility of combining wearable technology with Explainable Artificial Intelligence (XAI) and contestable LLMs to create a transparent, contestable system for psychiatric diagnosis that maintains clinical oversight while leveraging advanced AI capabilities. Our implementation is publicly available at: https://github.com/Analytics-Everywhere-Lab/heart2mind.
摘要
精神障碍影响着全球数百万人,但由于临床实践中主观评估和可及性问题,其诊断面临重大挑战,可能导致治疗延迟。为应对这一问题,我们提出Heart2Mind——一种基于可穿戴心电图(ECG)监测设备、以人为中心且具有可争议性的精神障碍诊断系统。该方法利用心脏生物标志物(特别是心率变异性HRV和R-R间期RRI时间序列)作为精神疾病自主神经功能障碍的客观指标。系统包含三个核心组件:(1) 用于从Polar H9/H10设备实时获取数据的心脏监测接口(CMI);(2) 通过时频域联合分析处理RRI时间序列的多尺度时频变换器(MSTFT);(3) 将自对抗解释(SAEs)与可争议大语言模型(LLMs)相结合的争议诊断接口(CDI)。我们的MSTFT在HRV-ACC数据集上采用留一法交叉验证达到91.7%准确率,优于现有最优方法。SAEs通过比较基于注意力和梯度的解释成功检测模型预测不一致性,而LLMs使临床医生能验证正确预测并质疑错误结果。这项工作证明了将可穿戴技术与可解释人工智能(XAI)及可争议LLMs相结合,构建透明、可争议精神障碍诊断系统的可行性,该系统在利用先进AI能力的同时保持临床监督。实现代码已公开于:https://github.com/Analytics-Everywhere-Lab/heart2mind。
OMAC: A Broad Optimization Framework for LLM-Based Multi-Agent Collaboration
Abstract
arXiv:2505.11765v1 Announce Type: new Abstract: Agents powered by advanced large language models (LLMs) have demonstrated impressive capabilities across diverse complex applications. Recently, Multi-Agent Systems (MAS), wherein multiple agents collaborate and communicate with each other, have exhibited enhanced capabilities in complex tasks, such as high-quality code generation and arithmetic reasoning. However, the development of such systems often relies on handcrafted methods, and the literature on systematic design and optimization of LLM-based MAS remains limited. In this work, we introduce OMAC, a general framework designed for holistic optimization of LLM-based MAS. Specifically, we identify five key optimization dimensions for MAS, encompassing both agent functionality and collaboration structure. Building upon these dimensions, we first propose a general algorithm, utilizing two actors termed the Semantic Initializer and the Contrastive Comparator, to optimize any single dimension. Then, we present an algorithm for joint optimization across multiple dimensions. Extensive experiments demonstrate the superior performance of OMAC on code generation, arithmetic reasoning, and general reasoning tasks against state-of-the-art approaches.
摘要
基于先进大语言模型(LLM)的智能体已在多样化的复杂应用中展现出卓越能力。近期,多智能体系统(MAS)通过智能体间的协作与通信,在代码生成和算术推理等复杂任务中表现出增强性能。然而,此类系统的开发通常依赖手工方法,关于基于LLM的MAS系统化设计与优化的研究仍较为有限。本研究提出OMAC框架,旨在实现基于LLM的MAS整体优化。具体而言,我们识别出MAS的五个关键优化维度,涵盖智能体功能与协作结构。基于这些维度,首先提出通用算法——利用语义初始化器和对比比较器两个执行组件——以优化单一维度;继而提出跨维度联合优化算法。大量实验表明,OMAC在代码生成、算术推理和通用推理任务上的性能显著优于现有最优方法。
REMOR: Automated Peer Review Generation with LLM Reasoning and Multi-Objective Reinforcement Learning
Abstract
arXiv:2505.11718v1 Announce Type: new Abstract: AI-based peer review systems tend to produce shallow and overpraising suggestions compared to human feedback. Here, we evaluate how well a reasoning LLM trained with multi-objective reinforcement learning (REMOR) can overcome these limitations. We start by designing a multi-aspect reward function that aligns with human evaluation of reviews. The aspects are related to the review itself (e.g., criticisms, novelty) and the relationship between the review and the manuscript (i.e., relevance). First, we perform supervised fine-tuning of DeepSeek-R1-Distill-Qwen-7B using LoRA on PeerRT, a new dataset of high-quality top AI conference reviews enriched with reasoning traces. We then apply Group Relative Policy Optimization (GRPO) to train two models: REMOR-H (with the human-aligned reward) and REMOR-U (with a uniform reward). Interestingly, the human-aligned reward penalizes aspects typically associated with strong reviews, leading REMOR-U to produce qualitatively more substantive feedback. Our results show that REMOR-U and REMOR-H achieve more than twice the average rewards of human reviews, non-reasoning state-of-the-art agentic multi-modal AI review systems, and general commercial LLM baselines. We found that while the best AI and human reviews are comparable in quality, REMOR avoids the long tail of low-quality human reviews. We discuss how reasoning is key to achieving these improvements and release the Human-aligned Peer Review Reward (HPRR) function, the Peer Review Reasoning-enriched Traces (PeerRT) dataset, and the REMOR models, which we believe can help spur progress in the area.
摘要
基于AI的同行评审系统往往会产生比人类反馈更肤浅且过度褒扬的建议。本研究评估了采用多目标强化学习(REMOR)训练的逻辑推理大语言模型如何克服这些局限。我们首先设计了一个与人类评审评价标准一致的多维度奖励函数,这些维度涉及评审本身特性(如批评性、新颖性)以及评审与稿件间关联性(即相关性)。研究首先在PeerRT数据集上使用LoRA方法对DeepSeek-R1-Distill-Qwen-7B模型进行监督微调,该数据集是富含推理痕迹的顶级AI会议高质量评审新数据集。随后应用组相对策略优化(GRPO)训练了两个模型:REMOR-H(采用人类对齐奖励)和REMOR-U(采用均匀奖励)。有趣的是,人类对齐奖励会惩罚通常与优质评审相关的维度,这使得REMOR-U能产生质量上更具实质性的反馈。结果表明,REMOR-U和REMOR-H获得的平均奖励超过人类评审、非推理型最先进多模态AI评审系统及通用商业大语言模型基线两倍以上。研究发现,虽然最佳AI评审与人类评审质量相当,但REMOR避免了人类评审中常见的低质量长尾现象。我们论证了逻辑推理是实现这些改进的关键,并开源了人类对齐同行评审奖励函数(HPRR)、富含推理的同行评审痕迹数据集(PeerRT)及REMOR模型,这些资源有望推动该领域发展。
Benchmarking Spatiotemporal Reasoning in LLMs and Reasoning Models: Capabilities and Challenges
Abstract
arXiv:2505.11618v1 Announce Type: new Abstract: Spatiotemporal reasoning plays a key role in Cyber-Physical Systems (CPS). Despite advances in Large Language Models (LLMs) and Large Reasoning Models (LRMs), their capacity to reason about complex spatiotemporal signals remains underexplored. This paper proposes a hierarchical SpatioTemporal reAsoning benchmaRK, STARK, to systematically evaluate LLMs across three levels of reasoning complexity: state estimation (e.g., predicting field variables, localizing and tracking events in space and time), spatiotemporal reasoning over states (e.g., inferring spatial-temporal relationships), and world-knowledge-aware reasoning that integrates contextual and domain knowledge (e.g., intent prediction, landmark-aware navigation). We curate 26 distinct spatiotemporal tasks with diverse sensor modalities, comprising 14,552 challenges where models answer directly or by Python Code Interpreter. Evaluating 3 LRMs and 8 LLMs, we find LLMs achieve limited success in tasks requiring geometric reasoning (e.g., multilateration or triangulation), particularly as complexity increases. Surprisingly, LRMs show robust performance across tasks with various levels of difficulty, often competing or surpassing traditional first-principle-based methods. Our results show that in reasoning tasks requiring world knowledge, the performance gap between LLMs and LRMs narrows, with some LLMs even surpassing LRMs. However, the LRM o3 model continues to achieve leading performance across all evaluated tasks, a result attributed primarily to the larger size of the reasoning models. STARK motivates future innovations in model architectures and reasoning paradigms for intelligent CPS by providing a structured framework to identify limitations in the spatiotemporal reasoning of LLMs and LRMs.
摘要
时空推理在信息物理系统(CPS)中具有关键作用。尽管大语言模型(LLM)和大推理模型(LRM)取得了进展,但其对复杂时空信号的推理能力仍待深入探索。本文提出分层时空推理基准STARK,系统评估LLM在三个推理复杂度层级的表现:状态估计(如预测场变量、时空事件定位与追踪)、基于状态的时空推理(如推断时空关系)以及融合上下文与领域知识的世界知识感知推理(如意图预测、地标感知导航)。我们构建了涵盖26种传感器模态的时空任务,包含14,552个挑战项,模型可通过直接回答或Python代码解释器完成。通过评估3个LRM和8个LLM,发现LLM在需要几何推理的任务(如多边测量或三角定位)中成功率有限,且随复杂度增加表现显著下降。值得注意的是,LRM在不同难度任务中均表现出鲁棒性,常优于或媲美传统基于第一性原理的方法。研究表明,在需要世界知识的推理任务中,LLM与LRM的性能差距缩小,部分LLM甚至超越LRM。但LRM o3模型在所有评估任务中持续保持领先优势,这主要归因于其更大的模型规模。STARK通过结构化框架揭示了LLM和LRM在时空推理中的局限性,为智能CPS的模型架构与推理范式创新提供了方向。
Communication-Efficient Hybrid Language Model via Uncertainty-Aware Opportunistic and Compressed Transmission
Abstract
arXiv:2505.11788v1 Announce Type: new Abstract: To support emerging language-based applications using dispersed and heterogeneous computing resources, the hybrid language model (HLM) offers a promising architecture, where an on-device small language model (SLM) generates draft tokens that are validated and corrected by a remote large language model (LLM). However, the original HLM suffers from substantial communication overhead, as the LLM requires the SLM to upload the full vocabulary distribution for each token. Moreover, both communication and computation resources are wasted when the LLM validates tokens that are highly likely to be accepted. To overcome these limitations, we propose communication-efficient and uncertainty-aware HLM (CU-HLM). In CU-HLM, the SLM transmits truncated vocabulary distributions only when its output uncertainty is high. We validate the feasibility of this opportunistic transmission by discovering a strong correlation between SLM's uncertainty and LLM's rejection probability. Furthermore, we theoretically derive optimal uncertainty thresholds and optimal vocabulary truncation strategies. Simulation results show that, compared to standard HLM, CU-HLM achieves up to 206 higher token throughput by skipping 74.8% transmissions with 97.4% vocabulary compression, while maintaining 97.4% accuracy.
摘要
为支持基于语言的应用程序利用分散异构计算资源,混合语言模型(HLM)提供了一种前景广阔的架构:设备端的小型语言模型(SLM)生成候选标记,由远程大型语言模型(LLM)进行验证和校正。然而原始HLM存在显著通信开销,因为LLM要求SLM为每个标记上传完整的词汇表概率分布。此外,当LLM验证极可能被接受的标记时,通信与计算资源均被浪费。为克服这些局限,我们提出通信高效且不确定性感知的HLM(CU-HLM)。在CU-HLM中,SLM仅在其输出不确定性较高时传输截断的词汇表分布。通过发现SLM不确定性与LLM拒绝概率间的强相关性,我们验证了这种机会式传输的可行性。进一步,我们理论推导出最优不确定性阈值与最优词汇表截断策略。仿真结果表明:相较于标准HLM,CU-HLM通过跳过74.8%的传输并实现97.4%的词汇表压缩,使标记吞吐量提升最高达206倍,同时保持97.4%的准确率。
ChatHTN: Interleaving Approximate (LLM) and Symbolic HTN Planning
Abstract
arXiv:2505.11814v1 Announce Type: new Abstract: We introduce ChatHTN, a Hierarchical Task Network (HTN) planner that combines symbolic HTN planning techniques with queries to ChatGPT to approximate solutions in the form of task decompositions. The resulting hierarchies interleave task decompositions generated by symbolic HTN planning with those generated by ChatGPT. Despite the approximate nature of the results generates by ChatGPT, ChatHTN is provably sound; any plan it generates correctly achieves the input tasks. We demonstrate this property with an open-source implementation of our system.
摘要
我们提出ChatHTN——一种结合符号化分层任务网络(HTN)规划技术与ChatGPT查询的分层任务网络规划器,其通过任务分解形式生成近似解。该体系结构交替整合符号化HTN规划生成的任务分解与ChatGPT产生的分解方案。尽管ChatGPT生成的结果具有近似性,但ChatHTN具有可证明的可靠性:其生成的任何计划都能正确完成输入任务。我们通过系统的开源实现验证了这一特性。
On the Eligibility of LLMs for Counterfactual Reasoning: A Decompositional Study
Abstract
arXiv:2505.11839v1 Announce Type: new Abstract: Counterfactual reasoning has emerged as a crucial technique for generalizing the reasoning capabilities of large language models (LLMs). By generating and analyzing counterfactual scenarios, researchers can assess the adaptability and reliability of model decision-making. Although prior work has shown that LLMs often struggle with counterfactual reasoning, it remains unclear which factors most significantly impede their performance across different tasks and modalities. In this paper, we propose a decompositional strategy that breaks down the counterfactual generation from causality construction to the reasoning over counterfactual interventions. To support decompositional analysis, we investigate 11 datasets spanning diverse tasks, including natural language understanding, mathematics, programming, and vision-language tasks. Through extensive evaluations, we characterize LLM behavior across each decompositional stage and identify how modality type and intermediate reasoning influence performance. By establishing a structured framework for analyzing counterfactual reasoning, this work contributes to the development of more reliable LLM-based reasoning systems and informs future elicitation strategies.
摘要
反事实推理已成为增强大语言模型(LLMs)推理能力的关键技术。通过生成和分析反事实场景,研究者能够评估模型决策的适应性与可靠性。尽管已有研究表明LLMs在反事实推理中常表现不佳,但何种因素对不同任务和模态下的性能阻碍最大仍不明确。本文提出一种分解策略,将反事实生成过程从因果构建拆解至反事实干预的推理阶段。为支持分解分析,我们研究了涵盖自然语言理解、数学、编程及视觉语言任务等11个数据集。通过大规模评估,我们刻画了LLMs在各分解阶段的行为特征,并揭示了模态类型与中间推理如何影响性能。本研究通过建立反事实推理的结构化分析框架,为开发更可靠的基于LLM的推理系统提供了基础,同时为未来能力激发策略提供了理论依据。
Solver-Informed RL: Grounding Large Language Models for Authentic Optimization Modeling
Abstract
arXiv:2505.11792v1 Announce Type: new Abstract: Optimization modeling is fundamental to decision-making across diverse domains.Despite progress in automating optimization formulation from natural language descriptions, Large Language Models (LLMs) often struggle to generate formally correct and usable models due to hallucinations, posing a challenge for reliable automation. Inspired by the success of Reinforcement Learning (RL) in enhancing Large Reasoning Models, we present Solver-Informed Reinforcement Learning (SIRL).This novel framework leverages external optimization solvers as verifiable reward mechanisms to significantly improve the authenticity of LLMs for optimization modeling.Acting as precise verifiers, these solvers automatically assess the executable code and the instance-level mathematical model represented by the associated LP file, yielding precise and comprehensive feedback signals -- including syntax, feasibility, and solution quality that directly inform the RL process. This automated verification process, powered by classic optimization solvers, also underpins our instance-enhanced self-consistency method to synthesize high-quality training data. Extensive experiments on diverse public benchmarks demonstrate that SIRL achieves state-of-the-art performance, substantially outperforming existing methods in generating accurate and executable optimization models.
摘要
优化建模是跨领域决策制定的基础。尽管从自然语言描述自动生成优化模型的研究已取得进展,但大型语言模型(LLMs)常因幻觉问题难以生成形式正确且可用的模型,这为可靠自动化带来了挑战。受强化学习(RL)在增强大型推理模型方面成功的启发,我们提出求解器知情强化学习(SIRL)框架。该创新方法利用外部优化求解器作为可验证的奖励机制,显著提升LLMs在优化建模中的真实性。这些求解器作为精确验证器,能自动评估可执行代码及关联LP文件所表示的实例级数学模型,产生精确全面的反馈信号——包括语法、可行性和解质量等直接指导RL过程的信息。这种由经典优化求解器驱动的自动化验证过程,还支撑了我们提出的实例增强自洽方法,用于合成高质量训练数据。在多样化公共基准测试上的大量实验表明,SIRL实现了最先进的性能,在生成准确且可执行的优化模型方面显著优于现有方法。
ToLeaP: Rethinking Development of Tool Learning with Large Language Models
Abstract
arXiv:2505.11833v1 Announce Type: new Abstract: Tool learning, which enables large language models (LLMs) to utilize external tools effectively, has garnered increasing attention for its potential to revolutionize productivity across industries. Despite rapid development in tool learning, key challenges and opportunities remain understudied, limiting deeper insights and future advancements. In this paper, we investigate the tool learning ability of 41 prevalent LLMs by reproducing 33 benchmarks and enabling one-click evaluation for seven of them, forming a Tool Learning Platform named ToLeaP. We also collect 21 out of 33 potential training datasets to facilitate future exploration. After analyzing over 3,000 bad cases of 41 LLMs based on ToLeaP, we identify four main critical challenges: (1) benchmark limitations induce both the neglect and lack of (2) autonomous learning, (3) generalization, and (4) long-horizon task-solving capabilities of LLMs. To aid future advancements, we take a step further toward exploring potential directions, namely (1) real-world benchmark construction, (2) compatibility-aware autonomous learning, (3) rationale learning by thinking, and (4) identifying and recalling key clues. The preliminary experiments demonstrate their effectiveness, highlighting the need for further research and exploration.
摘要
工具学习通过使大语言模型(LLMs)能够有效利用外部工具,因其在各行业革新生产力的潜力而受到越来越多的关注。尽管工具学习发展迅速,但关键挑战与机遇仍未得到充分研究,这限制了对该领域更深层次见解和未来进展的探索。本文通过复现33个基准测试并对其中7个实现一键评估(构建名为ToLeaP的工具学习平台),调查了41个主流LLMs的工具学习能力。我们还收集了33个潜在训练数据集中的21个,以促进未来研究。基于ToLeaP平台分析41个LLMs的3000余个失败案例后,我们识别出四大核心挑战:(1) 基准测试的局限性导致LLMs在(2)自主学习、(3)泛化能力及(4)长程任务解决能力方面存在缺失与不足。为推进未来发展,我们进一步探索了四个潜在方向:(1) 真实场景基准构建、(2) 兼容性感知的自主学习、(3) 通过思维推演进行原理学习、(4) 关键线索识别与召回。初步实验验证了这些方向的有效性,凸显了进一步研究与探索的必要性。
Fair-PP: A Synthetic Dataset for Aligning LLM with Personalized Preferences of Social Equity
Abstract
arXiv:2505.11861v1 Announce Type: new Abstract: Human preference plays a crucial role in the refinement of large language models (LLMs). However, collecting human preference feedback is costly and most existing datasets neglect the correlation between personalization and preferences. To address this issue, we introduce Fair-PP, a synthetic dataset of personalized preferences targeting social equity, derived from real-world social survey data, which includes 28 social groups, 98 equity topics, and 5 personal preference dimensions. Leveraging GPT-4o-mini, we engage in role-playing based on seven representative persona portrayals guided by existing social survey data, yielding a total of 238,623 preference records. Through Fair-PP, we also contribute (i) An automated framework for generating preference data, along with a more fine-grained dataset of personalized preferences; (ii) analysis of the positioning of the existing mainstream LLMs across five major global regions within the personalized preference space; and (iii) a sample reweighting method for personalized preference alignment, enabling alignment with a target persona while maximizing the divergence from other personas. Empirical experiments show our method outperforms the baselines.
摘要
人类偏好在大型语言模型(LLM)的优化过程中起着关键作用。然而,收集人类偏好反馈成本高昂,且现有数据集大多忽视了个性化与偏好之间的关联性。为解决这一问题,我们提出了Fair-PP——一个针对社会公平的合成个性化偏好数据集,该数据集源自真实世界的社会调查数据,涵盖28个社会群体、98个公平议题和5个个人偏好维度。基于GPT-4o-mini,我们根据现有社会调查数据指导的七种代表性人物画像进行角色扮演,最终生成238,623条偏好记录。通过Fair-PP,我们还贡献了:(i)一个自动化偏好数据生成框架,以及更细粒度的个性化偏好数据集;(ii)对现有主流LLM在五大全球区域个性化偏好空间中的定位分析;(iii)一种面向个性化偏好对齐的样本重加权方法,可实现与目标人物对齐的同时最大化与其他人物画像的差异性。实证实验表明,我们的方法优于基线模型。
VeriReason: Reinforcement Learning with Testbench Feedback for Reasoning-Enhanced Verilog Generation
Abstract
arXiv:2505.11849v1 Announce Type: new Abstract: Automating Register Transfer Level (RTL) code generation using Large Language Models (LLMs) offers substantial promise for streamlining digital circuit design and reducing human effort. However, current LLM-based approaches face significant challenges with training data scarcity, poor specification-code alignment, lack of verification mechanisms, and balancing generalization with specialization. Inspired by DeepSeek-R1, we introduce VeriReason, a framework integrating supervised fine-tuning with Guided Reward Proximal Optimization (GRPO) reinforcement learning for RTL generation. Using curated training examples and a feedback-driven reward model, VeriReason combines testbench evaluations with structural heuristics while embedding self-checking capabilities for autonomous error correction. On the VerilogEval Benchmark, VeriReason delivers significant improvements: achieving 83.1% functional correctness on the VerilogEval Machine benchmark, substantially outperforming both comparable-sized models and much larger commercial systems like GPT-4 Turbo. Additionally, our approach demonstrates up to a 2.8X increase in first-attempt functional correctness compared to baseline methods and exhibits robust generalization to unseen designs. To our knowledge, VeriReason represents the first system to successfully integrate explicit reasoning capabilities with reinforcement learning for Verilog generation, establishing a new state-of-the-art for automated RTL synthesis. The models and datasets are available at: https://huggingface.co/collections/AI4EDA-CASE Code is Available at: https://github.com/NellyW8/VeriReason
摘要
利用大语言模型(LLMs)自动化寄存器传输级(RTL)代码生成为简化数字电路设计、降低人力成本提供了巨大潜力。然而,当前基于LLM的方法面临训练数据稀缺、规范与代码对齐不佳、缺乏验证机制以及泛化与专业化平衡等重大挑战。受DeepSeek-R1启发,我们提出VeriReason框架,该框架将监督微调与引导奖励近端优化(GRPO)强化学习相结合,用于RTL生成。通过精选训练样本和反馈驱动的奖励模型,VeriReason将测试平台评估与结构启发式方法相结合,同时嵌入自检能力以实现自主纠错。在VerilogEval基准测试中,VeriReason表现出显著提升:在VerilogEval Machine基准上实现83.1%的功能正确率,大幅优于同规模模型及GPT-4 Turbo等大型商业系统。此外,相较于基线方法,我们的方案首次尝试功能正确率最高提升2.8倍,并对未见设计展现出强大泛化能力。据我们所知,VeriReason是首个成功将显式推理能力与强化学习结合用于Verilog生成的系统,为自动化RTL合成确立了新标杆。
MLLM-based Discovery of Intrinsic Coordinates and Governing Equations from High-Dimensional Data
Abstract
arXiv:2505.11940v1 Announce Type: new Abstract: Discovering governing equations from scientific data is crucial for understanding the evolution of systems, and is typically framed as a search problem within a candidate equation space. However, the high-dimensional nature of dynamical systems leads to an exponentially expanding equation space, making the search process extremely challenging. The visual perception and pre-trained scientific knowledge of multimodal large language models (MLLM) hold promise for providing effective navigation in high-dimensional equation spaces. In this paper, we propose a zero-shot method based on MLLM for automatically discovering physical coordinates and governing equations from high-dimensional data. Specifically, we design a series of enhanced visual prompts for MLLM to enhance its spatial perception. In addition, MLLM's domain knowledge is employed to navigate the search process within the equation space. Quantitative and qualitative evaluations on two representative types of systems demonstrate that the proposed method effectively discovers the physical coordinates and equations from both simulated and real experimental data, with long-term extrapolation accuracy improved by approximately 26.96% compared to the baseline.
摘要
从科学数据中发现支配方程对于理解系统演化至关重要,通常被构建为候选方程空间中的搜索问题。然而,动力系统的高维特性导致方程空间呈指数级扩张,使得搜索过程极具挑战性。多模态大语言模型(MLLM)的视觉感知与预训练科学知识有望为高维方程空间提供有效导航。本文提出一种基于MLLM的零样本方法,用于从高维数据中自动发现物理坐标与支配方程。具体而言,我们设计了一系列增强型视觉提示以提升MLLM的空间感知能力,并利用其领域知识引导方程空间内的搜索过程。通过对两类典型系统的定量与定性评估表明,该方法能有效从仿真和真实实验数据中发现物理坐标与方程,其长期外推精度较基线方法提升约26.96%。
LLM-Enhanced Feature Engineering for Multi-Factor Electricity Price Predictions
Abstract
arXiv:2505.11890v1 Announce Type: new Abstract: Accurately forecasting electricity price volatility is crucial for effective risk management and decision-making. Traditional forecasting models often fall short in capturing the complex, non-linear dynamics of electricity markets, particularly when external factors like weather conditions and market volatility are involved. These limitations hinder their ability to provide reliable predictions in markets with high volatility, such as the New South Wales (NSW) electricity market. To address these challenges, we introduce FAEP, a Feature-Augmented Electricity Price Prediction framework. FAEP leverages Large Language Models (LLMs) combined with advanced feature engineering to enhance prediction accuracy. By incorporating external features such as weather data and price volatility jumps, and utilizing Retrieval-Augmented Generation (RAG) for effective feature extraction, FAEP overcomes the shortcomings of traditional approaches. A hybrid XGBoost-LSTM model in FAEP further refines these augmented features, resulting in a more robust prediction framework. Experimental results demonstrate that FAEP achieves state-of-art (SOTA) performance compared to other electricity price prediction models in the Australian New South Wale electricity market, showcasing the efficiency of LLM-enhanced feature engineering and hybrid machine learning architectures.
摘要
准确预测电价波动对于有效的风险管理和决策制定至关重要。传统预测模型往往难以捕捉电力市场中复杂的非线性动态特性,特别是在涉及天气条件和市场波动等外部因素时。这些局限性导致其无法在澳大利亚新南威尔士州(NSW)等高波动性电力市场提供可靠预测。为解决这些问题,我们提出了FAEP框架——一种基于特征增强的电价预测方法。该框架通过结合大型语言模型(LLMs)与先进特征工程技术来提升预测精度,具体包括整合天气数据和价格波动跳跃等外部特征,并采用检索增强生成(RAG)技术实现高效特征提取。FAEP中的XGBoost-LSTM混合模型进一步优化了这些增强特征,从而构建出更稳健的预测框架。实验结果表明,在澳大利亚新南威尔士电力市场中,FAEP相比其他电价预测模型实现了最先进(SOTA)的性能,充分证明了LLM增强的特征工程与混合机器学习架构的有效性。
Evaluating the Logical Reasoning Abilities of Large Reasoning Models
Abstract
arXiv:2505.11854v1 Announce Type: new Abstract: Large reasoning models, often post-trained on long chain-of-thought (long CoT) data with reinforcement learning, achieve state-of-the-art performance on mathematical, coding, and domain-specific reasoning benchmarks. However, their logical reasoning capabilities - fundamental to human cognition and independent of domain knowledge - remain understudied. To address this gap, we introduce LogiEval, a holistic benchmark for evaluating logical reasoning in large reasoning models. LogiEval spans diverse reasoning types (deductive, inductive, analogical, and abductive) and task formats (e.g., logical sequence, argument analysis), sourced from high-quality human examinations (e.g., LSAT, GMAT). Our experiments demonstrate that modern reasoning models excel at 4-choice argument analysis problems and analogical reasoning, surpassing human performance, yet exhibit uneven capabilities across reasoning types and formats, highlighting limitations in their generalization. Our analysis reveals that human performance does not mirror model failure distributions. To foster further research, we curate LogiEval-Hard, a challenging subset identified through a novel screening paradigm where small-model failures (Qwen3-30B-A3B) reliably predict difficulties for larger models. Modern models show striking, consistent failures on LogiEval-Hard. This demonstrates that fundamental reasoning bottlenecks persist across model scales, and establishes LogiEval-Hard as both a diagnostic tool and a rigorous testbed for advancing logical reasoning in LLMs.
摘要
大规模推理模型通常通过长链思维(long CoT)数据的强化学习后训练,在数学、编程和特定领域推理基准测试中达到最先进性能。然而,其逻辑推理能力——作为人类认知基础且独立于领域知识的核心特质——仍未得到充分研究。为填补这一空白,我们提出LogiEval,一个用于评估大规模推理模型逻辑推理能力的综合基准。LogiEval涵盖演绎、归纳、类比和溯因等多元推理类型,以及逻辑序列、论点分析等多种任务形式,数据源自LSAT、GMAT等高质量人类考试。实验表明,现代推理模型在四选一论点分析问题和类比推理上表现优异甚至超越人类,但在不同推理类型和任务形式间存在能力不均,凸显其泛化局限。分析揭示人类表现与模型失败分布并不一致。为推动研究,我们通过新型筛选范式构建LogiEval-Hard挑战子集:小模型(Qwen3-30B-A3B)的失败可稳定预测大模型面临的困难。现代模型在LogiEval-Hard上表现出显著且一致的失败模式,证实基础推理瓶颈在不同规模模型中持续存在,同时确立该子集作为诊断工具和推进大语言模型逻辑推理研究的严格测试平台。
LifelongAgentBench: Evaluating LLM Agents as Lifelong Learners
Abstract
arXiv:2505.11942v1 Announce Type: new Abstract: Lifelong learning is essential for intelligent agents operating in dynamic environments. Current large language model (LLM)-based agents, however, remain stateless and unable to accumulate or transfer knowledge over time. Existing benchmarks treat agents as static systems and fail to evaluate lifelong learning capabilities. We present LifelongAgentBench, the first unified benchmark designed to systematically assess the lifelong learning ability of LLM agents. It provides skill-grounded, interdependent tasks across three interactive environments, Database, Operating System, and Knowledge Graph, with automatic label verification, reproducibility, and modular extensibility. Extensive experiments reveal that conventional experience replay has limited effectiveness for LLM agents due to irrelevant information and context length constraints. We further introduce a group self-consistency mechanism that significantly improves lifelong learning performance. We hope LifelongAgentBench will advance the development of adaptive, memory-capable LLM agents.
摘要
终身学习对于在动态环境中运行的智能体至关重要。然而当前基于大语言模型(LLM)的智能体仍处于无状态模式,无法随时间积累或迁移知识。现有基准测试将智能体视为静态系统,未能评估其终身学习能力。我们提出LifelongAgentBench——首个用于系统评估LLM智能体终身学习能力的统一基准,该基准在数据库、操作系统和知识图谱三个交互环境中提供技能导向的相互依存任务,具备自动标签验证、可复现性和模块化可扩展性。大量实验表明,由于无关信息和上下文长度限制,传统经验回放方法对LLM智能体效果有限。我们进一步提出群体自洽机制,可显著提升终身学习性能。期望LifelongAgentBench能推动具备记忆能力的自适应LLM智能体发展。
Arrow: Adaptive Scheduling Mechanisms for Disaggregated LLM Inference Architecture
Abstract
arXiv:2505.11916v1 Announce Type: new Abstract: Existing large language models (LLMs) serving systems typically employ Prefill-Decode disaggregated architecture to prevent computational interference between the prefill and decode phases. However, real-world LLM serving scenarios often exhibit significant fluctuations in request input/output lengths, causing traditional static prefill/decode node configuration ratio to result in imbalanced computational loads between these two nodes, consequently preventing efficient utilization of computing resources to improve the system's goodput. To address this challenge, we design and implement Arrow, an adaptive scheduler that leverages stateless instances and elastic instance pools to achieve efficient adaptive request and instance scheduling. Arrow dynamically adjusts the number of instances handling prefill and decode tasks based on real-time cluster performance metrics, significantly enhancing the system's capability to handle traffic spikes and load variations. Our evaluation under diverse real-world workloads shows that Arrow achieves up to and higher request serving rates compared to state-of-the-art PD-colocated and PD-disaggregated serving systems respectively.
摘要
现有大型语言模型(LLM)服务系统通常采用预填充-解码分离架构以避免两阶段间的计算干扰。然而实际应用中,请求输入/输出长度常呈现显著波动,导致传统静态节点配比引发计算负载失衡,从而阻碍计算资源的高效利用与系统吞吐提升。为此,我们设计并实现了自适应调度系统Arrow,通过无状态实例与弹性实例池实现高效的请求-实例动态调度。该系统基于实时集群性能指标动态调整预填充与解码任务实例数量,显著增强了系统应对流量峰值与负载波动的能力。多样化实际工作负载测试表明,相比当前最优的共置部署与分离部署系统,Arrow分别实现了5.62倍与7.78倍的请求处理速率提升。
LLM-based Automated Theorem Proving Hinges on Scalable Synthetic Data Generation
Abstract
arXiv:2505.12031v1 Announce Type: new Abstract: Recent advancements in large language models (LLMs) have sparked considerable interest in automated theorem proving and a prominent line of research integrates stepwise LLM-based provers into tree search. In this paper, we introduce a novel proof-state exploration approach for training data synthesis, designed to produce diverse tactics across a wide range of intermediate proof states, thereby facilitating effective one-shot fine-tuning of LLM as the policy model. We also propose an adaptive beam size strategy, which effectively takes advantage of our data synthesis method and achieves a trade-off between exploration and exploitation during tree search. Evaluations on the MiniF2F and ProofNet benchmarks demonstrate that our method outperforms strong baselines under the stringent Pass@1 metric, attaining an average pass rate of on MiniF2F and on ProofNet. These results underscore the impact of large-scale synthetic data in advancing automated theorem proving.
摘要
大语言模型(LLMs)的最新进展引发了人们对自动定理证明的广泛兴趣,当前主流研究将基于LLM的逐步证明器集成到树搜索中。本文提出了一种新颖的证明状态探索方法用于训练数据合成,该方法旨在生成覆盖广泛中间证明状态的多样化策略,从而实现对LLM作为策略模型的有效单次微调。我们还提出了一种自适应束宽策略,该策略充分利用我们的数据合成方法,在树搜索过程中实现探索与利用的平衡。在MiniF2F和ProofNet基准测试上的评估表明,我们的方法在严格的Pass@1指标下优于强基线模型,在MiniF2F上达到60.74%的平均通过率,在ProofNet上达到21.18%。这些结果凸显了大规模合成数据对推进自动定理证明领域的重要作用。
SOCIA: An End-to-End Agentic Framework for Automated Cyber-Physical-Social Simulator Generation
Abstract
arXiv:2505.12006v1 Announce Type: new Abstract: This paper introduces SOCIA (Simulation Orchestration for Cyber-physical-social Intelligence and Agents), a novel end-to-end framework leveraging Large Language Model (LLM)-based multi-agent systems to automate the generation of high-fidelity Cyber-Physical-Social (CPS) simulators. Addressing the challenges of labor-intensive manual simulator development and complex data calibration, SOCIA integrates a centralized orchestration manager that coordinates specialized agents for tasks including data comprehension, code generation, simulation execution, and iterative evaluation-feedback loops. Through empirical evaluations across diverse CPS tasks, such as mask adoption behavior simulation (social), personal mobility generation (physical), and user modeling (cyber), SOCIA demonstrates its ability to produce high-fidelity, scalable simulations with reduced human intervention. These results highlight SOCIA's potential to offer a scalable solution for studying complex CPS phenomena
摘要
本文介绍了一种新型端到端框架SOCIA(面向信息物理社会智能体与系统的仿真编排系统),该框架基于大语言模型(LLM)的多智能体系统,实现了高保真信息物理社会(CPS)模拟器的自动化生成。针对人工开发模拟器劳动密集和数据校准复杂等挑战,SOCIA通过集成中央编排管理器,协调数据理解、代码生成、仿真执行和迭代评估-反馈循环等专项任务的智能体。通过在口罩佩戴行为模拟(社会层面)、个人移动轨迹生成(物理层面)和用户建模(信息层面)等多样化CPS任务中的实证评估表明,SOCIA能够以较少人工干预生成高保真、可扩展的仿真系统。这些结果凸显了SOCIA为复杂CPS现象研究提供可扩展解决方案的潜力。
Solve-Detect-Verify: Inference-Time Scaling with Flexible Generative Verifier
Abstract
arXiv:2505.11966v1 Announce Type: new Abstract: Large Language Model (LLM) reasoning for complex tasks inherently involves a trade-off between solution accuracy and computational efficiency. The subsequent step of verification, while intended to improve performance, further complicates this landscape by introducing its own challenging trade-off: sophisticated Generative Reward Models (GenRMs) can be computationally prohibitive if naively integrated with LLMs at test-time, while simpler, faster methods may lack reliability. To overcome these challenges, we introduce FlexiVe, a novel generative verifier that flexibly balances computational resources between rapid, reliable fast thinking and meticulous slow thinking using a Flexible Allocation of Verification Budget strategy. We further propose the Solve-Detect-Verify pipeline, an efficient inference-time scaling framework that intelligently integrates FlexiVe, proactively identifying solution completion points to trigger targeted verification and provide focused solver feedback. Experiments show FlexiVe achieves superior accuracy in pinpointing errors within reasoning traces on ProcessBench. Furthermore, on challenging mathematical reasoning benchmarks (AIME 2024, AIME 2025, and CNMO), our full approach outperforms baselines like self-consistency in reasoning accuracy and inference efficiency. Our system offers a scalable and effective solution to enhance LLM reasoning at test time.
摘要
针对复杂任务的大语言模型(LLM)推理本质上需要在解决方案准确性与计算效率之间进行权衡。后续的验证步骤虽旨在提升性能,却因引入自身挑战性权衡而进一步复杂化:若在测试时简单整合生成式奖励模型(GenRM)与LLM,其高复杂度可能导致计算资源难以承受;而更简单快速的方法则可能缺乏可靠性。为克服这些挑战,我们提出FlexiVe——一种通过"验证预算弹性分配"策略灵活平衡快速可靠直觉思维与缜密审慎思维的新型生成式验证器。我们进一步设计"求解-检测-验证"流水线,该高效推理时扩展框架智能集成FlexiVe,主动识别求解完成节点以触发定向验证并提供针对性求解反馈。实验表明FlexiVe在ProcessBench基准上能精准定位推理轨迹中的错误。此外,在具有挑战性的数学推理基准(AIME 2024、AIME 2025和CNMO)上,我们的完整方案在推理准确性和推理效率方面均优于自洽性等基线方法。本系统为增强测试时LLM推理能力提供了可扩展的有效解决方案。
Interactional Fairness in LLM Multi-Agent Systems: An Evaluation Framework
Abstract
arXiv:2505.12001v1 Announce Type: new Abstract: As large language models (LLMs) are increasingly used in multi-agent systems, questions of fairness should extend beyond resource distribution and procedural design to include the fairness of how agents communicate. Drawing from organizational psychology, we introduce a novel framework for evaluating Interactional fairness encompassing Interpersonal fairness (IF) and Informational fairness (InfF) in LLM-based multi-agent systems (LLM-MAS). We extend the theoretical grounding of Interactional Fairness to non-sentient agents, reframing fairness as a socially interpretable signal rather than a subjective experience. We then adapt established tools from organizational justice research, including Colquitt's Organizational Justice Scale and the Critical Incident Technique, to measure fairness as a behavioral property of agent interaction. We validate our framework through a pilot study using controlled simulations of a resource negotiation task. We systematically manipulate tone, explanation quality, outcome inequality, and task framing (collaborative vs. competitive) to assess how IF influences agent behavior. Results show that tone and justification quality significantly affect acceptance decisions even when objective outcomes are held constant. In addition, the influence of IF vs. InfF varies with context. This work lays the foundation for fairness auditing and norm-sensitive alignment in LLM-MAS.
摘要
随着大型语言模型(LLMs)在多智能体系统中的日益广泛应用,公平性问题应从资源分配和程序设计延伸至智能体间交互的公平性评估。借鉴组织心理学理论,我们提出一个新颖的框架用于评估基于LLM的多智能体系统(LLM-MAS)中的交互公平性,该框架包含人际公平(IF)和信息公平(InfF)两个维度。我们将交互公平性的理论基础扩展至非感知智能体,将其重新定义为社会可解读的信号而非主观体验。随后,我们采用组织公正研究中的成熟工具——包括Colquitt组织公正量表和关键事件技术——将公平性量化为智能体交互的行为属性。通过资源协商任务的受控模拟实验,我们对该框架进行了初步验证:系统操纵语气、解释质量、结果不平等性及任务框架(协作型vs.竞争型)以评估IF对智能体行为的影响。结果显示,即使在客观结果恒定的情况下,语气和理由质量仍显著影响接受决策。此外,IF与InfF的相对影响力随情境变化而不同。本研究为LLM-MAS的公平性审计和规范敏感对齐奠定了基础。
Demystifying and Enhancing the Efficiency of Large Language Model Based Search Agents
Abstract
arXiv:2505.12065v1 Announce Type: new Abstract: Large Language Model (LLM)-based search agents have shown remarkable capabilities in solving complex tasks by dynamically decomposing problems and addressing them through interleaved reasoning and retrieval. However, this interleaved paradigm introduces substantial efficiency bottlenecks. First, we observe that both highly accurate and overly approximate retrieval methods degrade system efficiency: exact search incurs significant retrieval overhead, while coarse retrieval requires additional reasoning steps during generation. Second, we identify inefficiencies in system design, including improper scheduling and frequent retrieval stalls, which lead to cascading latency -- where even minor delays in retrieval amplify end-to-end inference time. To address these challenges, we introduce SearchAgent-X, a high-efficiency inference framework for LLM-based search agents. SearchAgent-X leverages high-recall approximate retrieval and incorporates two key techniques: priority-aware scheduling and non-stall retrieval. Extensive experiments demonstrate that SearchAgent-X consistently outperforms state-of-the-art systems such as vLLM and HNSW-based retrieval across diverse tasks, achieving up to 3.4 higher throughput and 5 lower latency, without compromising generation quality. SearchAgent-X is available at https://github.com/tiannuo-yang/SearchAgent-X.
摘要
基于大语言模型(LLM)的搜索代理通过动态分解问题并交织推理与检索来解决复杂任务,展现出卓越能力。然而这种交织范式存在显著的效率瓶颈。首先,我们发现高精度检索与过度近似检索方法均会降低系统效率:精确搜索带来巨大检索开销,而粗略检索则需在生成过程中增加额外推理步骤。其次,系统设计存在低效问题,包括不当调度和频繁检索停滞,导致级联延迟——即使检索中的微小延迟也会放大端到端推理时间。针对这些挑战,我们提出SearchAgent-X,一个面向LLM搜索代理的高效推理框架。该框架采用高召回率近似检索,并整合两项关键技术:优先级感知调度和无停滞检索。大量 实验表明,SearchAgent-X在多样化任务中持续优于vLLM和基于HNSW检索等先进系统,最高可实现3.4倍吞吐量提升和5倍延迟降低,且不损害生成质量。SearchAgent-X已在https://github.com/tiannuo-yang/SearchAgent-X开源。
Efficient RL Training for Reasoning Models via Length-Aware Optimization
Abstract
arXiv:2505.12284v1 Announce Type: new Abstract: Large reasoning models, such as OpenAI o1 or DeepSeek R1, have demonstrated remarkable performance on reasoning tasks but often incur a long reasoning path with significant memory and time costs. Existing methods primarily aim to shorten reasoning paths by introducing additional training data and stages. In this paper, we propose three critical reward designs integrated directly into the reinforcement learning process of large reasoning models, which reduce the response length without extra training stages. Experiments on four settings show that our method significantly decreases response length while maintaining or even improving performance. Specifically, in a logic reasoning setting, we achieve a 40% reduction in response length averaged by steps alongside a 14% gain in performance. For math problems, we reduce response length averaged by steps by 33% while preserving performance.
摘要
大型推理模型(如OpenAI o1或DeepSeek R1)在推理任务中展现出卓越性能,但通常伴随冗长的推理路径,导致显著的内存与时间开销。现有方法主要通过引入额外训练数据和阶段来缩短推理路径。本文提出三种关键奖励设计,将其直接集成至大型推理模型的强化学习过程中,从而无需额外训练阶段即可缩减响应长度。在四种实验场景中,我们的方法在保持甚至提升性能的同时显著降低了响应长度。具体而言,在逻辑推理场景中,我们实现了步骤平均响应长度减少40%,同时性能提升14%;对于数学问题,在保持性能不变的情况下,步骤平均响应长度减少33%。
Tiny QA Benchmark++: Ultra-Lightweight, Synthetic Multilingual Dataset Generation & Smoke-Tests for Continuous LLM Evaluation
Abstract
arXiv:2505.12058v1 Announce Type: new Abstract: Tiny QA Benchmark++ (TQB++) presents an ultra-lightweight, multilingual smoke-test suite designed to give large-language-model (LLM) pipelines a unit-test style safety net dataset that runs in seconds with minimal cost. Born out of the tight feedback-loop demands building the Comet Opik prompt-optimization SDK, where waiting on heavyweight benchmarks breaks developer flow. TQB++ couples a 52-item English gold set (less than 20 kB) with a tiny synthetic-data generator pypi package built on provider-agnostic LiteLLM. The generator lets practitioners mint their own tiny packs in any language, domain, or difficulty, while ten ready-made packs already cover Arabic, Chinese, French, German, Japanese, Korean, Portuguese, Russian, Spanish, and Turkish. Every dataset ships with Croissant metadata and plug-and-play files for OpenAI-Evals, LangChain, and standard CI tools, so teams can drop deterministic micro-benchmarks directly into pull-request gates, prompt-engineering loops, and production dashboards without touching GPU budgets. A complete TQB++ run adds only a few seconds to pipeline latency yet reliably flags prompt-template errors, tokenizer drift, and fine-tuning side-effects long before full-scale suites like MMLU or BIG-Bench would finish configuring. The entire framework is released to accelerate continuous, resource-efficient quality assurance across the generative-AI ecosystem.
摘要
Tiny QA Benchmark++(TQB++)提出了一种超轻量级、多语言的冒烟测试套件,旨在为大型语言模型(LLM)流程提供一个单元测试风格的安全网数据集,该数据集可在数秒内以极低成本运行。该工具源于构建Comet Opik提示优化SDK时对紧密反馈循环的需求,因为在开发过程中等待重量级基准测试会中断开发流程。TQB++将包含52个项目的英语黄金数据集(小于20 kB)与一个基于与提供商无关的LiteLLM构建的微型合成数据生成器PyPI包相结合。该生成器允许从业者以任何语言、领域或难度创建自己的微型数据集包,同时已提供的十个现成包覆盖了阿拉伯语、中文、法语、德语、日语、韩语、葡萄牙语、俄语、西班牙语和土耳其语。每个数据集均附带Croissant元数据以及即插即用文件,支持OpenAI-Evals、LangChain和标准CI工具,使团队能够将确定性微基准测试直接集成到拉取请求门控、提示工程循环和生产仪表板中,而无需触及GPU预算。完整的TQB++运行仅增加几秒的流程延迟,却能可靠地标记出提示模板错误、分词器漂移和微调副作用,其速度远超MMLU或BIG-Bench等全规模测试套件的配置时间。整个框架的发布旨在加速生成式AI生态系统中持续且资源高效的质量保障。
CorBenchX: Large-Scale Chest X-Ray Error Dataset and Vision-Language Model Benchmark for Report Error Correction
Abstract
arXiv:2505.12057v1 Announce Type: new Abstract: AI-driven models have shown great promise in detecting errors in radiology reports, yet the field lacks a unified benchmark for rigorous evaluation of error detection and further correction. To address this gap, we introduce CorBenchX, a comprehensive suite for automated error detection and correction in chest X-ray reports, designed to advance AI-assisted quality control in clinical practice. We first synthesize a large-scale dataset of 26,326 chest X-ray error reports by injecting clinically common errors via prompting DeepSeek-R1, with each corrupted report paired with its original text, error type, and human-readable description. Leveraging this dataset, we benchmark both open- and closed-source vision-language models,(e.g., InternVL, Qwen-VL, GPT-4o, o4-mini, and Claude-3.7) for error detection and correction under zero-shot prompting. Among these models, o4-mini achieves the best performance, with 50.6 % detection accuracy and correction scores of BLEU 0.853, ROUGE 0.924, BERTScore 0.981, SembScore 0.865, and CheXbertF1 0.954, remaining below clinical-level accuracy, highlighting the challenge of precise report correction. To advance the state of the art, we propose a multi-step reinforcement learning (MSRL) framework that optimizes a multi-objective reward combining format compliance, error-type accuracy, and BLEU similarity. We apply MSRL to QwenVL2.5-7B, the top open-source model in our benchmark, achieving an improvement of 38.3% in single-error detection precision and 5.2% in single-error correction over the zero-shot baseline.
摘要
人工智能驱动模型在放射学报告错误检测方面展现出巨大潜力,但该领域目前缺乏统一的基准来严格评估错误检测及后续修正能力。为填补这一空白,我们推出CorBenchX——一个用于胸片报告自动错误检测与修正的综合测试平台,旨在推进临床实践中AI辅助的质量控制。我们首先通过提示DeepSeek-R1注入临床常见错误,合成了包含26,326份胸片错误报告的大规模数据集,每份错误报告均配有原始文本、错误类型及人工可读描述。基于该数据集,我们对开源和闭源视觉语言模型(如InternVL、Qwen-VL、GPT-4o、o4-mini和Claude-3.7)进行零样本提示下的错误检测与修正基准测试。其中o4-mini表现最佳,检测准确率达50.6%,修正评分为BLEU 0.853、ROUGE 0.924、BERTScore 0.981、SembScore 0.865和CheXbertF1 0.954,但仍未达到临床级精度,凸显了精确报告修正的挑战性。为推进技术发展,我们提出多步强化学习(MSRL)框架,通过优化格式合规性、错误类型准确率和BLEU相似度的多目标奖励函数。将该框架应用于基准测试中表现最佳的开源模型QwenVL2.5-7B后,单错误检测精度提升38.3%,单错误修正率较零样本基线提高5.2%。
BeliefNest: A Joint Action Simulator for Embodied Agents with Theory of Mind
Abstract
arXiv:2505.12321v1 Announce Type: new Abstract: This paper introduces an open-source simulator, BeliefNest, designed to enable embodied agents to perform collaborative tasks by leveraging Theory of Mind. BeliefNest dynamically and hierarchically constructs simulators within a Minecraft environment, allowing agents to explicitly represent nested belief states about themselves and others. This enables agent control in open-domain tasks that require Theory of Mind reasoning. The simulator provides a prompt generation mechanism based on each belief state, facilitating the design and evaluation of methods for agent control utilizing large language models (LLMs). We demonstrate through experiments that agents can infer others' beliefs and predict their belief-based actions in false-belief tasks.
摘要
本文介绍了一款开源模拟器BeliefNest,旨在通过心智理论实现具身智能体执行协作任务。该模拟器在Minecraft环境中动态分层构建仿真框架,使智能体能够显式表征自我与他人的嵌套信念状态,从而支持需要心智理论推理的开放域任务中的智能体控制。该模拟器提供基于各信念状态的提示生成机制,便于利用大语言模型(LLMs)进行智能体控制方法的设计与评估。实验表明,智能体在错误信念任务中能够推断他人信念并预测其基于信念的行为。
LLM-BABYBENCH: Understanding and Evaluating Grounded Planning and Reasoning in LLMs
Abstract
arXiv:2505.12135v1 Announce Type: new Abstract: Assessing the capacity of Large Language Models (LLMs) to plan and reason within the constraints of interactive environments is crucial for developing capable AI agents. We introduce \textbf{LLM-BabyBench}, a new benchmark suite designed specifically for this purpose. Built upon a textual adaptation of the procedurally generated BabyAI grid world, this suite evaluates LLMs on three fundamental aspects of grounded intelligence: (1) predicting the consequences of actions on the environment state (\textbf{Predict} task), (2) generating sequences of low-level actions to achieve specified objectives (\textbf{Plan} task), and (3) decomposing high-level instructions into coherent subgoal sequences (\textbf{Decompose} task). We detail the methodology for generating the three corresponding datasets (\texttt{LLM-BabyBench-Predict}, \texttt{-Plan}, \texttt{-Decompose}) by extracting structured information from an expert agent operating within the text-based environment. Furthermore, we provide a standardized evaluation harness and metrics, including environment interaction for validating generated plans, to facilitate reproducible assessment of diverse LLMs. Initial baseline results highlight the challenges posed by these grounded reasoning tasks. The benchmark suite, datasets, data generation code, and evaluation code are made publicly available (\href{https://github.com/choukrani/llm-babybench}{\text{GitHub}}, \href{https://huggingface.co/datasets/salem-mbzuai/LLM-BabyBench}{\text{HuggingFace}}).
摘要
评估大型语言模型(LLM)在交互环境约束下进行规划和推理的能力,对于开发强大的人工智能代理至关重要。为此,我们推出专为这一目标设计的全新基准测试套件\textbf{LLM-BabyBench}。该套件基于文本化改编的程序化生成BabyAI网格世界构建,从具身智能的三个基础维度评估LLM:(1) 预测行为对环境状态的影响(\textbf{Predict}任务),(2) 生成实现特定目标的底层动作序列(\textbf{Plan}任务),(3) 将高层指令分解为连贯的子目标序列(\textbf{Decompose}任务)。我们详细阐述了通过从文本环境中运行的专家代理提取结构化信息,生成三个对应数据集(\texttt{LLM-BabyBench-Predict}、\texttt{-Plan}、\texttt{-Decompose})的方法论,并提供了标准化评估框架与指标(包括用于验证生成计划的环境交互机制),以促进不同LLM的可复现评估。初始基线结果凸显了这些具身推理任务带来的挑战。该基准套件、数据集、数据生成代码及评估代码均已开源(\href{https://github.com/choukrani/llm-babybench}{\text{GitHub}}、\href{https://huggingface.co/datasets/salem-mbzuai/LLM-BabyBench}{\text{HuggingFace}})。
ZenFlow: Enabling Stall-Free Offloading Training via Asynchronous Updates
Abstract
arXiv:2505.12242v1 Announce Type: new Abstract: Fine-tuning large language models (LLMs) often exceeds GPU memory limits, prompting systems to offload model states to CPU memory. However, existing offloaded training frameworks like ZeRO-Offload treat all parameters equally and update the full model on the CPU, causing severe GPU stalls, where fast, expensive GPUs sit idle waiting for slow CPU updates and limited-bandwidth PCIe transfers. We present ZenFlow, a new offloading framework that prioritizes important parameters and decouples updates between GPU and CPU. ZenFlow performs in-place updates of important gradients on GPU, while asynchronously offloading and accumulating less important ones on CPU, fully overlapping CPU work with GPU computation. To scale across GPUs, ZenFlow introduces a lightweight gradient selection method that exploits a novel spatial and temporal locality property of important gradients, avoiding costly global synchronization. ZenFlow achieves up to 5x end-to-end speedup, 2x lower PCIe traffic, and reduces GPU stalls by over 85 percent, all while preserving accuracy.
摘要
微调大型语言模型(LLM)常超出GPU内存限制,促使系统将模型状态卸载至CPU内存。然而现有卸载训练框架(如ZeRO-Offload)均等对待所有参数并在CPU上更新完整模型,导致严重的GPU停滞——高速昂贵的GPU因等待低速CPU更新和有限带宽的PCIe传输而闲置。 我们提出ZenFlow框架,通过优先级划分实现参数差异化处理,并解耦GPU与CPU的更新过程。该框架在GPU上原位更新重要梯度,同时将次要梯度异步卸载至CPU进行累积,实现CPU工作与GPU计算的完全重叠。 为支持多GPU扩展,ZenFlow引入轻量级梯度选择方法,利用重要梯度特有的时空局部性特性,避免昂贵的全局同步。实验表明,ZenFlow在保持精度的前提下,可实现最高5倍的端到端加速,降低50%的PCIe传输流量,并将GPU停滞减少85%以上。
Beyond Single-Point Judgment: Distribution Alignment for LLM-as-a-Judge
Abstract
arXiv:2505.12301v1 Announce Type: new Abstract: LLMs have emerged as powerful evaluators in the LLM-as-a-Judge paradigm, offering significant efficiency and flexibility compared to human judgments. However, previous methods primarily rely on single-point evaluations, overlooking the inherent diversity and uncertainty in human evaluations. This approach leads to information loss and decreases the reliability of evaluations. To address this limitation, we propose a novel training framework that explicitly aligns the LLM-generated judgment distribution with empirical human distributions. Specifically, we propose a distributional alignment objective based on KL divergence, combined with an auxiliary cross-entropy regularization to stabilize the training process. Furthermore, considering that empirical distributions may derive from limited human annotations, we incorporate adversarial training to enhance model robustness against distribution perturbations. Extensive experiments across various LLM backbones and evaluation tasks demonstrate that our framework significantly outperforms existing closed-source LLMs and conventional single-point alignment methods, with improved alignment quality, evaluation accuracy, and robustness.
摘要
在"LLM即评委"范式下,大语言模型(LLM)已成为强大的评估工具,相较于人工评判展现出显著的效率与灵活性优势。然而既有方法主要依赖单点评估,忽视了人类评估固有的多样性与不确定性,导致信息丢失并降低评估可靠性。为解决这一局限,我们提出了一种新颖的训练框架,通过显式对齐LLM生成的判断分布与经验性人类分布来实现优化。具体而言,我们设计了基于KL散度的分布对齐目标函数,并结合辅助交叉熵正则化以稳定训练过程。进一步考虑到经验分布可能源自有限的人工标注数据,我们引入对抗训练以增强模型对分布扰动的鲁棒性。跨多种LLM主干模型和评估任务的大规模实验表明,本框架显著优于现有闭源LLM和传统单点对齐方法,在对齐质量、评估准确性和鲁棒性方面均有提升。